Incorporating linguistic post-processing into whole-book recognition

نویسندگان

  • Pingping Xiu
  • Henry S. Baird
چکیده

We describe a technique of linguistic post-processing of whole-book recognition results. Whole-book recognition is a technique that improves recognition of book images using fully automatic cross-entropy-based model adaptation. In previous published works, word recognition was performed on individual words separately, without awaring passage-level information such as word-occurrence frequencies. Therefore, some rare words in real texts may appear much more often in recognition results; vice versa. Differences between word frequencies in recognition results and in prior knowledge may indicate recognition errors on a long passage. In this paper, we propose a post-processing technique to enhance whole-book recognition results by minimizing differences between word frequencies in recognition results and prior word frequencies. This technique works better when operating on longer passages, and it drives the character error rate down 20% from 1.24% to 0.98% in a 90-page experiment.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Incorporating A Rich Linguistic Model into Whole-Book Recognition

Whole-book recognition, a technique that improves recognition of book-images using fully automatic mutual-entropybased model adaptation, has achieved character error rate as low as 1.9% on 50 pages of real book images in our previous publications. However, the linguistic model for word recognition was simple, assuming a uniform distribution on the words in the dictionary, so that the algorithm ...

متن کامل

Incorporating linguistic knowledge and automatic baseform generation in acoustic subword unit based speech recognition

A major challenge in speech recognition based on acoustic subword units is creating a lexicon which is robust to interand intra-speaker variations. In this paper we present two di erent approaches for incorporating simple word-level linguistic knowledge into the labelling step of the training procedure. The proposed systems also utilise a scheme for combined optimisation of baseforms and subwor...

متن کامل

Multi-level post-processing for Korean character recognition using morphological analysis and linguistic evaluation

Most of the post-processing methods for character recognition rely on contextual information of character and word-fragment levels. However, due to linguistic characteristics of Korean, such low-level information alone is not sufficient for high-quality character-recognition applications, and we need much higher-level contextual information to improve the recognition results. This paper present...

متن کامل

Incorporating Cognitive Linguistic Insights into Classrooms: the Case of Iranian Learners’ Acquisition of If-Clauses

Cognitive linguistics gives the most inclusive, consistent description of how language is organized, used and learned to date. Cognitive linguistics contains a great number of concepts that are useful to second language learners.  If-clauses in English, on the other hand, remain intriguing for foreign language learners to struggle with, due to their intrinsic intricacies. EFL grammar books are ...

متن کامل

LAperLA: an integrated graphical-linguistic System for old printed Latin Texts

LAperLA (Lettore Automatico per Libri Antichi) is a prototype for the automatic recognition of Latin texts in old printed books. The strengths of the system are the neural architecture and the post-processing linguistic tool that is represented by an index of Latin forms (more than 500,000) and by a query management system which uses the information of the index to check and correct the interpr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010